Support deepspeed sequence parallel #31525

zeyugao · 2024-06-20T16:59:50Z

What does this PR do?

Support the sequence parallel with Deepspeed-Ulysses.

I have tested the training on starcoder2-3b. The loss decreases normally.

I have made massive modifications to the original implementation of Deepspeed-Ulysses to support batch size dim in layers.py. It uses all_to_all_single instead of all_to_all like https://github.com/InternLM/InternEvo/blob/a61d391df96c5f5c243cdea32a5044b70d6fe33e/internlm/core/parallel/comm/isp.py#L628 for better performance. I have left some comments to help the future understanding. Use all_to_all_single is too complex to support other scatter idx and gather idx

Currently, flash attn and sdpa for llama and mistral are tested. flash attn for starcoder is also tested, the sdpa for starcoder is not supported.

It requires a special dataloader (I have made in Trainer) and data collator (with example followed). In data collator, the sequence should be divided into multiple sub-sequences. The following is an example of sub-sequences processing in the data collator.

            seq_parallel_world_size = mpu.get_sequence_parallel_world_size()
            seq_parallel_world_rank = mpu.get_sequence_parallel_rank()

            seq_length = input_ids.size(1)
            sub_seq_length = seq_length // seq_parallel_world_size
            sub_seq_start = seq_parallel_world_rank * sub_seq_length
            sub_seq_end = (seq_parallel_world_rank + 1) * sub_seq_length

            # There is no kv cache when training
            past_key_values_length = 0

            position_ids = torch.arange(
                past_key_values_length, seq_length + past_key_values_length, dtype=torch.long,
            )
            position_ids = position_ids.unsqueeze(0).view(-1, seq_length)

            batch = dict(
                input_ids=input_ids[:, sub_seq_start:sub_seq_end],
                labels=labels[:, sub_seq_start:sub_seq_end],
                position_ids=position_ids[:, sub_seq_start:sub_seq_end],
                attention_mask=(input_ids != self.tokenizer.pad_token_id),
            )

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@muellerzr and @SunMarc

Raise exception when sdpa

fan-niu · 2024-06-27T02:08:32Z

Great, can you provide an example of data processing based on sequence paralleler? thanks

zeyugao · 2024-06-27T04:16:28Z

The dataset and sampler are handled in the Trainer

https://github.com/huggingface/transformers/pull/31525/files#diff-ed55888e6665791fe92cc8fc0c499da54f4ace6738551cd9a2591881cda076deR847-R855

The data collator example is accidentally deleted when editing

            seq_parallel_world_size = mpu.get_sequence_parallel_world_size()
            seq_parallel_world_rank = mpu.get_sequence_parallel_rank()

            seq_length = input_ids.size(1)
            sub_seq_length = seq_length // seq_parallel_world_size
            sub_seq_start = seq_parallel_world_rank * sub_seq_length
            sub_seq_end = (seq_parallel_world_rank + 1) * sub_seq_length

            # There is no kv cache when training
            past_key_values_length = 0

            position_ids = torch.arange(
                past_key_values_length, seq_length + past_key_values_length, dtype=torch.long,
            )
            position_ids = position_ids.unsqueeze(0).view(-1, seq_length)

            batch = dict(
                input_ids=input_ids[:, sub_seq_start:sub_seq_end],
                labels=labels[:, sub_seq_start:sub_seq_end],
                position_ids=position_ids[:, sub_seq_start:sub_seq_end],
                attention_mask=(input_ids != self.tokenizer.pad_token_id),
            )

…uence-parallel

github-actions · 2024-08-10T08:04:33Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ldh127 · 2024-09-22T12:13:59Z

how long time this pr merge, when can it finish ? ...

LysandreJik · 2024-09-27T09:00:00Z

cc @SunMarc if you have the bandwidth to take a look!

glowwormX · 2024-10-08T02:05:28Z

@zeyugao I carefully read your pull requests for transformers and accelerate, and pulled your code to try training. Now I have encountered a problem: when entering DistributedAttention, the q, k, v before _SeqAllToAll.apply are not [b, s/p, n, h], but still [b, s, n, h]. I checked the modified parts of the data processing, such as accelerate/data_loader.py and transformers/trainer.py, but did not find any relevant processing code. So, may I ask where the sequence splitting is done?

zeyugao · 2024-10-08T02:08:30Z

@glowwormX It is in the pr description

glowwormX · 2024-10-08T02:14:32Z

@zeyugao My God, I missed it, I thought there was this code in pr. Thank you for replying.

glowwormX · 2024-10-17T09:41:33Z

@zeyugao Have you compared the loss of sequence parallel? After a fixed seed is added to DistributedSampler, the training data is the same. Modify the trainer.py:

        if is_accelerate_available() and mpu.sequence_parallel_is_enabled():
            assert self.args.group_by_length is False, "Group by length is not supported with sequence parallel."
            return DistributedSampler(
                dataset=self.train_dataset,
                num_replicas=mpu.get_data_parallel_world_size(),
                rank=mpu.get_data_parallel_rank(),
                shuffle=True,
                seed=42
            )

However, when the same data is calculated, the average loss value after sequence parallel is different from the loss value without sequence parallel.

In addition, what is the reason why starcoder does not support sdpa? I am trying to modify qwen2 and I do not know if it does not support sdpa.

zeyugao · 2024-10-19T03:46:26Z

@glowwormX The main reason should be that it need to use custom loss calculation, otherwise there are some tokens (in the head and tail of each subsequence) not contributing to the final loss: https://github.com/microsoft/DeepSpeed/pull/5774/files#diff-13f25bb51b0f4019d8cb09c07204a33510dca5dccfae736baf10134f893704d5

the reason why starcoder does not support sdpa

I do not have much spare time to make the shape correct when using sdpa for startcoder2 at that time

ronald-d-rogers · 2024-11-26T04:43:34Z

@zeyugao: Your implementation does not use this loss function right? It still works ok even so?

zeyugao added 7 commits June 20, 2024 23:50

Add original deepspeed

6d6fb4b

Support override the seed_worker in Trainer

b5f054c

Add some necessary check on sequence parallel argument

6f72b8a

Add DistributedAttention

e311c0b

Add starcoder2 as sequence parallel supported

f2a6cc9

Add llama, mistral

42fd905

Raise exception when sdpa

Move DistributedSampler initialization into trainer

1c1eed2

zeyugao marked this pull request as draft June 21, 2024 02:25

zeyugao added 3 commits June 21, 2024 11:10

Fix llama query shape when _upad_input

d3b0ce0

Use all_to_all for flexiablity

8d565f8

Support sdpa for llama and mistral

2727691

zeyugao marked this pull request as ready for review June 21, 2024 10:21

zeyugao added 5 commits July 16, 2024 20:00

Fix miss understood train_batch_size calcuation

fbb7e0b

Fix args.world_size calcuation in model parallel

cf29d6d

Merge remote-tracking branch 'origin/main' into support-deepspeed-seq…

01a4cdd

…uence-parallel

Run ruff check

57488b8

Run ruff format

278873c

zeyugao mentioned this pull request Jul 30, 2024

Some adjustment for supporting Deepspeed-Ulysses huggingface/accelerate#2877

Open

5 tasks

github-actions bot closed this Aug 18, 2024

SunMarc reopened this Aug 26, 2024

ArthurZucker added Feature request Request for a new feature DeepSpeed labels Aug 27, 2024

ronald-d-rogers mentioned this pull request Nov 26, 2024

[REQUEST] Some questions about deepspeed sequence parallel microsoft/DeepSpeed#6708

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support deepspeed sequence parallel #31525

Support deepspeed sequence parallel #31525

zeyugao commented Jun 20, 2024 •

edited

Loading

fan-niu commented Jun 27, 2024

zeyugao commented Jun 27, 2024 •

edited

Loading

github-actions bot commented Aug 10, 2024

ldh127 commented Sep 22, 2024

LysandreJik commented Sep 27, 2024

glowwormX commented Oct 8, 2024

zeyugao commented Oct 8, 2024

glowwormX commented Oct 8, 2024

glowwormX commented Oct 17, 2024

zeyugao commented Oct 19, 2024 •

edited

Loading

ronald-d-rogers commented Nov 26, 2024

Support deepspeed sequence parallel #31525

Are you sure you want to change the base?

Support deepspeed sequence parallel #31525

Conversation

zeyugao commented Jun 20, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

fan-niu commented Jun 27, 2024

zeyugao commented Jun 27, 2024 • edited Loading

github-actions bot commented Aug 10, 2024

ldh127 commented Sep 22, 2024

LysandreJik commented Sep 27, 2024

glowwormX commented Oct 8, 2024

zeyugao commented Oct 8, 2024

glowwormX commented Oct 8, 2024

glowwormX commented Oct 17, 2024

zeyugao commented Oct 19, 2024 • edited Loading

ronald-d-rogers commented Nov 26, 2024

zeyugao commented Jun 20, 2024 •

edited

Loading

zeyugao commented Jun 27, 2024 •

edited

Loading

zeyugao commented Oct 19, 2024 •

edited

Loading